When the Best Move Isn't Optimal: Q-learning with Exploration

نویسنده

George H. John

چکیده

The most popular delayed reinforcement learning technique, Q-learning (Watkins 1989)) estimates the future reward expected from executing each action in every state. If these estimates are correct, then an agent can use them to select the action with maximal expected future reward in each state, and thus perform optimally. Watkins has proved that Q-learning produces an optimal policy (the function mapping states to actions) and that these estimates converge to the correct values given the optimal policy. However, often the agent does not follow the optimal policy faithfully the agent must also explore the world, taking suboptimal actions in order to learn more about its environment. The “optimal” policy produced by Q-learning is no longer optimal if its prescriptions are only followed occasionally. In many situations (e.g., dynamic environments), the agent never stops exploring. In such domains Q-learning converges to policies which are suboptimal in the sense that there exists a different policy which would achieve higher reward when combined with exploration. A bit of notation: &(z, a) is the expected future reward received after taking action a in state x. V(x) is the expected future reward received after starting in state x. 0 and v are used to denote the approximations kept by the algorithm. Each time the agent takes an action a moving it from state x to state y and generating a reward T, Q-learning updates the approximations according to the following rules:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

RTP-Q: A Reinforcement Learning System with Time Constraints Exploration Planning for Accelerating the Learning Rate

Reinforcement learning is an efficient method for solving Markov Decision Processes that an agent improves its performance by using scalar reward values with higher capability of reactive and adaptive behaviors. Q-learning is a representative reinforcement learning method which is guaranteed to obtain an optimal policy but needs numerous trials to achieve it. k-Certainty Exploration Learning Sy...

متن کامل

A Comparison of Reinforcement Learning Methods for Automatic Guided Vehicle Scheduling

Automatic Guided Vehicles or AGVs are increasingly being used in manufacturing plants for transportation tasks. Optimal scheduling of AGVs is a difficult problem. A learning AGV is very attractive in a manufacturing plant since it is hard to manually optimize the scheduling algorithm to each new situation. In this paper we compare four reinforcement learning methods for scheduling AGVs. Q-learn...

متن کامل

An Advance Q Learning (AQL) Approach for Path Planning and Obstacle Avoidance of a Mobile Robot

The goal of this paper is to improve the performance of the well known Q learning algorithm, the robust technique of Machine learning to facilitate path planning in an environment. Until this time the Q learning algorithms like Classical Q learning(CQL)algorithm and Improved Q learning (IQL) algorithm deal with an environment without obstacles, while in a real environment an agent has to face o...

متن کامل

Probabilistic Exploration in Planning while Learning

Sequential decision tasks with incomplete infor mation are characterized by the exploration prob lem; namely the trade-off between further exploration for learning more about the environ ment and immediate exploitation of the accrued information for decision-making. Within artificial intelligence, there has been an increasing interest in studying planning-while-learning algorithms for these ...

متن کامل

To Discount or Not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning

Most work in reinforcement learning (RL) is based on discounted techniques, such as Q learning, where long-term rewards are geometrically attenuated based on the delay in their occurence. Schwartz recently proposed an undiscounted RL technique called R learning that optimizes average reward, and argued that it was a better metric than the discounted one optimized by Q learning. In this paper we...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1994

When the Best Move Isn't Optimal: Q-learning with Exploration

نویسنده

چکیده

منابع مشابه

RTP-Q: A Reinforcement Learning System with Time Constraints Exploration Planning for Accelerating the Learning Rate

A Comparison of Reinforcement Learning Methods for Automatic Guided Vehicle Scheduling

An Advance Q Learning (AQL) Approach for Path Planning and Obstacle Avoidance of a Mobile Robot

Probabilistic Exploration in Planning while Learning

To Discount or Not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning

عنوان ژورنال:

اشتراک گذاری